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1.  Introduction 


This  work  was  performed  for  the  DEMO  III  Unmanned  Ground  Vehicle  (UGV)  program,  which 
is  developing  UGVs  that  will  assist  U.S.  Army  scouts.  The  Electro-Optics  Infrared  (EOIR) 
Image  Processing  branch  (AMSRL-SE-SE)  has  been  tasked  with  developing  algorithms  for 
acquiring  and  recognizing  targets  imaged  by  the  Wescam  Forward-Looking  Infrared  (FLIR) 
sensor.  These  images  are  sent  back  to  the  user  upon  request  or  when  the  automatic  target 
recognizer  (ATR)  indicates  a  location  of  interest.  The  user  makes  the  ultimate  decision  about 
whether  an  object  in  an  image  is  actually  a  target.  The  ATR  reduces  the  bandwidth  requirement 
of  the  communication  link  because  the  imagery  can  be  sent  back  at  reduced  resolution,  except 
those  regions  indicated  by  the  ATR  as  being  possible  targets.  The  algorithms  consist  of  a  front- 
end  detector,  a  clutter  rejector,  and  a  recognizer.  The  next  three  sections  describe  these 
components. 


2.  The  Detection  Algorithm 


The  algorithm  described  in  this  report  was  designed  to  address  a  need  for  a  detection  algorithm 
with  wide  applicability  which  could  serve  as  a  prescreener/detector  for  a  number  of  applications. 
While  most  automatic  target  detection/recognition  (ATD/R)  algorithms  use  much  problem- 
specific  knowledge  to  improve  performance,  the  result  is  an  algorithm  that  is  tailored  to  specific 
target  types  and  poses.  The  approximate  range  to  target  is  often  required,  with  varying  amounts 
of  tolerance.  For  example,  in  some  scenarios,  it  is  assumed  that  the  range  is  known  to  within  a 
meter  from  a  laser  range  finder  or  a  digital  map.  In  other  scenarios,  only  the  range  to  the  center 
of  the  field-of-view  and  the  depression  angle  is  known  so  that  a  flat  earth  approximation 
provides  the  best  estimate.  Many  algorithms,  both  model-based  and  learning-based,  either 
require  accurate  range  information  or  compensate  for  inaccurate  information  by  attempting  to 
detect  targets  at  a  number  of  different  ranges  within  the  tolerance  of  the  range.  Because  many 
such  algorithms  are  quite  sensitive  to  scale,  even  a  modest  range  tolerance  requires  that  the 
algorithm  iterate  through  a  large  number  of  closely  spaced  scales,  driving  up  both  the 
computational  complexity  and  the  false  alarm  rate.  Algorithms  have  often  used  statistical 
methods  [1]  or  view-based  neural  networks  [2,  3, 4]. 

The  proximate  motivation  for  the  development  of  the  scale-insensitive  algorithm  was  to  provide 
a  fast  prescreener  for  a  robotic  application  for  which  no  range  information  was  available.  The 
algorithm  instead  attempted  to  find  targets  at  all  ranges  between  some  reasonable  minimum, 
determined  from  operational  requirements  and  the  maximum  effective  range  of  the  sensor. 

Another  motivation  was  to  develop  an  algorithm  that  could  be  applied  to  a  wide  variety  of  image 
sets  and  sensor  types,  which  required  it  to  perform  consistently  on  new  data,  without  the  severe 
degradation  in  performance  that  commonly  occurs  with  learning  algorithms,  such  as  neural 
networks  and  principal  component  analysis  (PCA)-based  methods,  that  have  been  trained  on  a 
limited  variety  of  sensor  types,  terrain  types,  and  environmental  conditions.  While  we  recognize 
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that  with  a  suitable  training  set,  learning  algorithms  will  often  perform  better  than  other  methods, 
this  typically  requires  a  large  and  expensive  training  set,  which  is  sometimes  not  feasible. 

2.1  The  Data 

The  dataset  used  in  training  and  testing  this  system  was  the  April  1992  Comanche  FLIR 
collection  at  Fort  Hunter-Ligget,  CA.  This  dataset  consists  of  1225  images,  each  720  by  480 
pixels.  Each  image  has  a  field  of  view  of -1.75°  squared. 

Each  of  the  images  contains  one  or  two  targets  in  a  hilly,  wooded  background.  Ground  truth  was 
available,  which  provided  target  centroid,  range-to-target,  target  type,  target  aspect,  range-to- 
center  of  field-of-view,  and  the  depression  angle.  The  target  centroid  and  range-to-target  were 
used  to  score  the  algorithm,  as  described  in  the  experimental  results  section,  but  none  of  the 
target-specific  information  was  used  in  the  testing  process.  The  algorithm  only  assumes  that  the 
vertical  and  horizontal  fields  of  view  and  the  number  of  pixels  horizontally  and  vertically  is 
known.  The  only  range  information  used  is  the  operational  minimum  range  and  the  maximum 
effective  range  of  the  sensor. 

2.2  The  Features 

Each  of  the  features  is  calculated  for  every  pixel  in  the  image.  As  more  complex  features  are 
added  in  the  future,  it  might  become  beneficial  to  calculate  some  of  the  features  only  at  those 
locations  for  which  the  other  feature  values  are  high.  While  each  of  the  features  assumes 
knowledge  of  the  range  to  determine  approximate  target  size,  these  features  are  not  highly  range 
sensitive.  The  algorithm  calculates  each  of  these  features  at  coarsely  sampled  ranges  between  the 
minimum  and  maximum  allowed  range.  The  features  are  described  below. 

Each  of  the  features  was  chosen  based  on  intuition,  with  the  criteria  that  they  be  monotonic  and 
computationally  simple.  The  features  are  described  in  decreasing  order  of  importance. 


2,2.1  Maximum  Grey  Level-Feature  0 

The  maximum  grey  level  is  the  highest  grey  level  within  a  roughly  target-sized  rectangle 
centered  on  the  pixel.  It  was  chosen  because  in  many  FLIR  images  of  vehicles,  there  are  a  few 
pixels  that  are  significantly  hotter  than  the  rest  of  the  target  or  the  background.  These  pixels  are 
usually  on  the  engine,  the  exhaust  manifold,  or  the  exhaust  pipe.  The  feature  is  defined  as 


f-O 

Fj  j  =  max, 


(1) 


yjhere  J{k,r)  is  the  grey  level  value  of  the  pixel  in  the  kth  row  and  /th  column,  N^iJ)  is  the 
neighborhood  of  the  pixel  (ij),  defined  as  a  rectangle  whose  width  is  the  length  of  the  longest 
vehicle  in  the  target  set  and  whose  height  is  the  height  of  the  tallest  vehicle  in  the  target  set.  For 
the  applications  that  we  have  considered,  the  width  is  7  m  and  the  height  is  3  m. 


2.2.2  Contrastbox-Feature  1 

The  contrastbox  feature  measures  the  average  grey  level  over  a  target-sized  region  and  compares 
it  to  the  grey  level  of  the  local  background.  It  was  chosen  because  many  pixels  that  are  not  on  the 
engine  or  on  other  particularly  hot  portions  of  the  target  are  still  somewhat  warmer  than  the 
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natural  background.  This  feature  has  been  used  by  a  large  number  of  authors.  The  feature  is 
defined  as 


F'j=—  2  nu)-—  2  f{k.i).  (2) 

ikJ)eN,„{i,j)  ’^out  (k,l)sN^Ji,j) 

where  «<,„/  is  the  number  of  pixels  in  Noii^iJ),  rim  is  the  number  of  pixels  in  Ni„(i,j)  is  the 

target-sized  neighborhood  defined  above,  and  the  neighborhood  Nou^hj)  contains  all  of  the  pixels 
in  a  larger  reetangle  aroxmd  (ij),  except  those  pixels  that  are  in  Nm(iJ). 

2.2.3  Average  Gradient  Strength-Feature  2 

The  gradient  strength  feature  was  chosen  because  manmade  objects  tend  to  show  sharper  internal 
detail  than  natural  objects,  even  when  the  average  intensity  is  similar.  To  prevent  large  regions  of 
background  that  show  higher  than  normal  variation  from  showing  a  high  value  for  this  feature, 
the  average  gradient  strength  of  the  local  background  is  subtracted  from  the  average  gradient 
strength  of  the  target-sized  region.  The  feature  is  caleulated  as 

21  (3) 

in  (W)eVi^(!,y)  "out  {k,l)sN^{ij) 

where 

=  G^„{i,j)  +  Gl{i,j) ,  (4) 

GU^J)  =  |/0',7)  -  f{i,j  + 1)1 ,  (5) 

Gl  =  \f{i,j)  -  f{i  +  l,i)| ,  (6) 

and  Go„i(iJ)  is  defined  similarly. 

2.2.4  Local  Variation-Feature  3 

The  local  variation  feature  was  chosen  because  manmade  objects  often  show  greater  variation  in 
temperature  than  natural  objeets.  This  feature  merely  determines  the  average  absolute  differenee 
between  each  pixel  and  the  mean  of  the  internal  region  and  compares  it  to  the  same  measurement 
for  a  local  background  region.  The  feature  is  calculated  as 

p3  _  KuriiJ)  KuriiJ) 

t - ,  (7) 

'Vi 

where 

X  \fM-  ^ini^Jh  (8) 
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and 


=  —  S  (9) 

^in  {kJ)€N^„{i.j) 

and  Lout  (ij)  and  fiin(ij)  are  defined  similarly. 

2.2.5  Straight  Edge-Feature  4 

The  straight  edge  feature  was  chosen  because  manmade  object  often  display  straighter 
temperature  gradients  than  natural  objects,  especially  in  the  near  vertical  and  horizontal 
directions.  This  feature  measures  the  strength  of  a  straight  edge  that  extends  for  several  pixels, 
and  then  determines  if  the  edge  values  in  a  target  sized  region  differ  from  the  local  background. 
The  feature  is  calculated  as 

”hi  {kJ)eN.„(iJ)  ^out  (kJ)eN^,{i.j) 

where 

(11) 

S  +  (12) 

|iH</ 

S  l/(<,*:)-/(/  +  U-)|,  (13) 

lA-y|</ 

and  GoiiiiU)  's  defined  similarly.  The  parameter  /  is  a  function  of  field-of-view  of  the  system, 
target  range,  and  target  size.  If  the  sensor  or  target  is  significantly  tilted,  the  functions  can  be 
suitably  modified  to  measure  edges  in  other  directions,  at  the  cost  of  more  computation  and  less 
discrimination  ability. 

2.2.6  Rectangular  Gradient  Strength-Feature  5 

This  feature  seeks  to  take  advantage  of  nearly  rectangular  target-sized  shapes  by  combining  the 
straight  edge  strengths  that  would  make  up  the  outer  boundary  of  a  roughly  rectangular  target. 
The  straight  edge  strengths  are  defined  as  above. 

The  four  values  to  be  combined  are  //^,  (/  +  /,, .y),  //*(/- 4, y),  //’,(/,y  +  /,,),  which 

correspond  to  the  horizontal  top  and  bottom  edges  of  the  target,  and  the  right  and  left  vertical 
edges.  The  values  /,.  and  4  correspond  to  the  half  height  and  half  width  of  the  target,  respectively. 
The  values  are  combined  using  the  Lm  norm. 

2.2.7  Vertical  Gradient  Strength-Feature  6 

This  feature  takes  advantage  of  the  relative  rarity  of  straight  vertical  edges  in  images  taken  at 
nearly  horizontal  viewing  angles.  The  sides  of  targets  are  often  the  most  prevelant  vertical  edges. 
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though  tree  trunks  and  other  features  sometimes  compete.  The  straight  vertical  edge  image  is 
calculated  as  before. 


K{iJ)=  S  +  (14) 

and  the  local  maximum  of  Hj^(i,j)over  a  target-sized  region  is  calculated. 


2.2.8  How  the  Features  Were  Selected 

A  full  description  of  the  feature  selection  is  outside  the  scope  of  this  report.  A  large  number  of 
features  were  programmed,  and  the  value  of  these  features  were  calculated  over  a  large  number 
of  randomly  selected  pixels  in  the  images  of  the  training  set.  The  feature  values  were  also 
calculated  at  the  ground  truth  location  of  the  targets.  Histograms  were  computed  for  each  of  the 
features  for  both  the  target  and  background  pixels,  and  a  measure  of  separability  was  calculated. 
The  correlation  of  the  features  was  also  calculated  to  avoid  choosing  several  features  that  are 
similar.  Some  of  the  features  were  highly  correlated,  which  was  expected  because  one  of  the 
purposes  of  the  training  was  to  determine  which  of  the  similar  features  provided  the  greatest 
separability.  For  example,  a  number  of  contrast  features  were  used,  which  normalized  the  target 
and  background  values  by  local  standard  deviation  of  the  background,  or  of  the  target,  or  neither. 
Similarly,  a  number  of  gradient  strength  features  were  calculated.  The  feature  pruning  process 
was  ad  hoc,  so  it  would  be  reasonable  to  expect  that  performance  improvement  could  be  obtained 
by  a  more  rigourous  approach. 

2.3  Combining  the  Features 

Each  feature  is  normalized  across  the  image  so  that  the  feature  value  at  each  pixel  represents  the 
number  of  standard  deviations  from  the  mean  of  that  feature.  Thus  the  normalized  feature  image 
for  the  /nth  feature  is  normalized  as 


F'”  -  u 

pffi.N  _  i,j 

“  <T 

^  m 


(15) 


where 


and 


all(kj) 


(16) 


(17) 


aU(kJ) 


After  normalization,  the  features,  each  of  which  is  calculated  for  each  pixel,  are  linearly 
combined  into  a  confidence  image, 


3 

2 

m=0 


(18) 
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where  the  feature  weights  <7^  are  determined  using  an  algorithm  not  described  here.  The 
confidence  value  of  each  pixel  is  mapped  by  a  scaling  function  S:9t  [0,1],  as 


l-e 


«G,  / 


(19) 


where  « is  a  constant. 

This  scaling  does  not  change  the  relative  value  of  the  various  pixels,  it  merely  scales  them  to  the 
interval  [0,1]  for  convenience.  Confidence  numbers  are  often  limited  to  this  interval  because  they 
are  estimates  of  the  a  posteriori  probability.  While  this  is  not  true  for  our  algorithm,  using  this 
interval  is  convenient  for  evaluators. 

To  determine  the  detection  locations  from  the  scaled  confidence  image,  the  pixel  value  with  the 
maximum  confidence  value  is  chosen.  Then  a  target-sized  neighborhood  around  the  image  is  set 
to  zero  so  that  the  search  for  subsequent  detections  will  not  choose  a  pixel  location 
corresponding  to  the  same  target.  The  process  is  then  repeated  for  the  fixed  number  of  detections 
chosen  before  the  algorithm  was  run. 

2.4  Experimental  Results 

The  training  results  on  the  Hunter-Liggett  April  1 992  Region  of  Interest  (ROI)  database  are 
shown  in  the  Required  Operational  Capability  (ROC)  curve  in  Figure  1.  Test  results  on  the  July 
1992  ROI  database  collected  at  Yuma  Proving  Grounds  is  shown  in  Figure  2,  and  for  Greyling 
August  1992  ROI  database  in  Figure  3.  The  Yuma  test  images  are  much  more  difficult,  because 
they  were  taken  in  the  desert  in  July,  so  many  locations  in  the  image  have  a  higher  apparent 
temperature  than  the  targets.  The  images  from  Greyling,  MI  are  significantly  easier  because  the 
temperatures  are  more  mild,  and  are  comparable  in  difficulty  to  the  training  data.  Note  that  no 
training  images  were  used  from  anywhere  but  Hunter-Liggett,  so  the  results  suggest  that  the 
algorithm  is  not  sensitive  to  the  training  background.  This  is  not  surprising  given  the  simplicity 
of  the  algorithm,  but  sensitivity  is  common  to  many  learning  algorithms.  Figures  4  and  5  show  a 
sample  image  and  the  results  of  the  algorithm  on  the  image.  The  cross  denotes  the  ground  truth 
targets,  and  the  xs  denote  the  detections  on  the  targets.  Detections  are  designated  hits  if  the 
detection  center  falls  anywhere  on  the  actual  target.  Otherwise,  they  are  designated  false  alarms. 
The  top  three  detections,  ranked  by  confidence  number,  are  designated  on  the  image.  The  top 
two  detections  are  hits,  while  the  third  falls  near  the  target  and  is  designated  a  false  alarm. 
Figures  6  and  7  show  another,  somewhat  more  difficult,  image  and  associated  algorithm  results. 
The  top  detection  falls  on  a  target  in  the  bottom  left  of  the  image,  while  the  second  highest 
detection  is  a  false  alarm  near  the  center  of  the  image.  The  location  looks  like  a  possible  target;  it 
is  merely  a  warm  spot  on  the  dirt  road. 
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Figure  L  ROC  curve  on  Hunter-Liggett  April  1992  imagery.  The  horizontal 

axis  gives  the  average  number  of  false  alarms  per  frame,  the  vertical 
axis  is  the  target  detection  probability. 


Figure  2.  ROC  curve  on  Yuma  July  1992  imagery. 


Figure  3.  ROC  curve  on  Grey  ling  August  1992  imagery 


Figure  6.  Moderate  image  from  Hunter- Liggett 
April  1992  data  set. 


Figure  7.  Results  on  previous  image. 


The  algorithm  was  also  tested  on  data  collected  specifically  for  the  DEMO  III  program  to  ensure 
that  performance  does  not  degrade  because  of  the  different  sensor.  The  DEMO  III  sensor  is 
sensitive  in  the  midwave,  3-5  p  region,  while  the  previous  data  was  in  the  longwave,  8-12  p 
region.  Figure  8  shows  an  ROC  curve  on  data  collected  by  the  DEMO  111  sensor  at  the  Fort 
Indiantown  Gap  DEMO  III  site. 


Figure  8.  ROC  curve  on  12-bit  2001  Fort  Indiantown  gap  data. 


To  determine  if  raw  grey  level  information  could  be  used  to  locate  targets  without  the  use  of 
shape  information,  histograms  of  the  37  Fort  Indiantown  Gap  images  were  formed.  Figure  9 
shows  the  histogram  for  images  that  contain  no  targets,  and  Figure  10  shows  the  histogram 
magnified  lOOx  to  show  the  tail  of  the  distribution.  The  idea  is  to  determine  if  the  tail  for  images 

with  targets  is  higher  than  for  images  without  targets.  Figures  1 1  and  12  show  the  corresponding 
histograms  for  images  with  targets.  It  appears  that  the  raw  grey  level  information  would  be  a 
poor  discriminant  for  target  detection. 

The  algorithm  is  being  used  by  the  DEMO  III  program  to  reduce  the  amount  of  imagery  that 
must  be  trasmitted  via  radiolink  to  a  human  user.  It  will  also  be  used  by  the  Sensors  for  UGV 
program  at  Night  Vision  and  Electronic  Sensors  Directorate  (NVESD),  to  prescreen  uncooled 
FLIR  imagery  and  indicate  potential  targets  that  should  be  looked  at  more  closely  with  an  active 
laser  sensor.  It  has  been  used  by  a  synthetic  image  validation  tool,  by  measuring  the  performance 
of  the  algorithm  in  comparison  to  real  imagery. 


11 


Figure  12.  Histogram  of  grey  levels  of  Fort  Indiantown  gap  images  with  targets. 
The  y  axis  has  been  magnified  lOOx  to  show  tail  of  distribution. 
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3.  The  Clutter  Rejection  Algorithm 


The  purpose  of  the  clutter  rejection  algorithm  is  to  further  examine  locations  indicated  by  the 
detector  to  determine  if  targets  are  present.  Because  the  clutter  rejecter  doesn't  examine  the 
whole  image,  the  algorithm  can  be  more  computationally  intensive.  The  algorithm  used  here  is 
based  on  PCA-based  dimensionality  reduction,  followed  by  a  multilayer  perceptron  (MLP) 
trained  to  reject  clutter  and  accept  targets. 

The  limited  diversity  of  the  training  set  required  that  dimensionality  reduction  be  performed 
before  a  neural  network  is  used.  This  is  important  because  using  a  learning  algorithm  prior  to 
dimensionality  reduction  requires  a  large  and  diverse  training  set  to  avoid  overtraining,  resulting 
in  a  sharp  difference  between  training  and  testing  performance.  The  architecture  of  the  algorithm 
has  a  front-end  PCA  dimensionality  reduction  component,  followed  by  a  multilayer  perceptron 
that  uses  only  the  individual  PCA  components  as  inputs.  The  output  of  the  MLP,  along  with  the 
feature  values  from  the  detector,  are  combined  by  a  higher  level  MLP.  The  following  sections 
describe  the  PCA  and  MLP  components  and  describe  experimental  results. 

3.1  PCA 

Also  referred  to  as  the  Hotelling  transform  or  the  discrete  Karhunen-Loeve  transform,  PCA  is 
based  on  statistical  properties  of  vector  representations.  PCA  is  an  important  tool  for  image 
processing  because  it  has  several  useful  properties,  such  as  decorrelation  of  data  and  compaction 
of  information  (energy).  We  provide  here  a  summary  of  the  basic  theory  of  PCA. 

Assume  a  population  of  random  vectors  of  the  form 

‘1 

•'■2 

-»'3 

The  mean  vector  and  the  covariance  matrix  of  the  vector  population  x  are  defined  as 

,  and  (21 ) 

Cx  (22) 

where  ^arg  is  the  expected  value  of  the  argument,  and  T  indicates  vector  transposition.  Because  x 
is  «-dimensional,  C,  is  a  matrix  of  order  «  x  «.  Element  C/,  of  C,  is  the  variance  of  X/  (the  ith 

component  of  the  x  vectors  in  the  population),  and  element  c/,  of  C,  is  the  covariance  between 
elements  x,  x;  of  these  vectors.  The  matrix  C*  is  real  and  symmetric.  If  elements  Xj  and  xj  are 
uncorrelated,  their  covariance  is  zero  and,  therefore,  c,y  =  c/,  =  0.  For  N  vector  samples  from  a 
random  population,  the  mean  vector  and  covariance  matrix  can  be  approximated  from  the 
samples  by 
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(23) 


1  " 

m.  =  —  y  x„  ,  and 
C,  =  ^  X  (^P^P  -  “x“* ) 

p=\ 

Because  C*  is  real  and  symmetric,  we  can  always  find  a  set  of  n  orthonormal  eigenvectors  for 
this  covariance  matrix.  Figure  13  shows  the  first  100  (out  of  the  800  possible  in  this  case)  most 
dominant  PCA  eigen-targets  and  eigen-clutters,  which  were  extracted  from  the  target  and  clutter 
chips  in  the  training  set,  respectively.  Having  the  largest  eigenvalues,  these  eigenvectors  capture 
the  greatest  variance  or  energy  as  well  as  the  most  meaningful  features  among  the  training  data. 


Figure  13.  100  most  dominant  PCA  eigenvectors  extracted  from  the  target  chips. 


Let  Ci  and  A/,  /  =  1,  2, ...,  w,  be  the  eigenvectors  and  the  corresponding  eigenvalues  of  C*,  sorted 
in  a  descending  order  so  that  y>XjH  fory  =  1,2, ...,«- 1 .  Let  A  be  a  matrix  whose  rows  are 
formed  from  the  eigenvectors  of  Cx,  such  that 


A  = 


(25) 


L«nJ 

This  A  matrix  can  be  used  as  a  linear  transformation  matrix  that  maps  the  \s  into  vectors, 
denoted  by  ys,  as  follows: 

y  =  A(x-iii,)  .  (26) 
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Conversely,  we  may  want  to  reconstruct  vector  x  from  vector  y.  Because  the  rows  of  A  are 
orthonormal  vectors,  A"'  =  A^.  Therefore,  any  vector  x  can  be  reconstructed  from  its 
corresponding  y  by  the  relation 

x  =  A^y  +  iii,.  (27) 

Instead  of  using  all  the  eigenvectors  of  Cx,  we  may  pick  only  k  eigenvectors  corresponding  to  the 
k  largest  eigenvalues  and  form  a  new  transformation  matrix  A*  of  order  A:  x  «.  In  this  case,  the 

resulting  y  vectors  would  be  A-dimensional,  and  the  reconstruction  given  in  equation  (27)  would 
no  longer  be  exact.  The  reconstructed  vector  using  A*  is 

x  =  A[y  +  m,  (28) 

The  mean  square  error  (MSB)  between  x  and  x  can  be  computed  by  the  expression 

j=i  y=t  y=^+i 

Because  the  J^’s  decrease  monotonically,  equation  (29)  shows  that  we  can  minimize  the  error  by 

selecting  the  k  eigenvectors  associated  with  the  k  largest  eigenvalues.  Thus,  the  PCA  transform  is 
optimal  in  the  sense  that  it  minimizes  the  MSE  between  vectors  x  and  their  approximations  x . 

As  we  can  see  from  figure  13,  only  the  first  few  score  of  the  eigen-targets  contain  consistent  and 
structurally  significant  information  pertaining  to  the  training  data.  These  eigentargets  exhibit  a 
reduction  in  information  content  as  their  associated  eigenvalues  rapidly  decrease.  For  the  less 
meaningful  eigentargets  (say,  the  50th  and  all  the  way  up  to  the  800th)  only  high-frequency 
information  is  present.  In  other  words,  by  choosing  A  =  50  in  equation  (29)  when  n  =  800,  the 
resulting  distortion  error,  e,  would  be  small.  While  the  distortion  is  negligible,  there  is  a  16-fold 

reduction  in  input  dimensionality. 

3.2  MLP 

After  projecting  an  input  chip  to  a  chosen  set  of  A  eigen-targets,  the  resulting  A  projection  values 
are  fed  to  an  MLP  classifier  where  they  are  combined  nonlinearly.  A  typical  MLP  used  in  our 
experiments  has  A  +  1  input  nodes  (with  an  extra  bias  input),  several  layers  of  hidden  nodes,  and 
one  output  node.  In  addition  to  full  connections  between  consecutive  layers,  there  are  also 
shortcut  connections  directly  from  one  layer  to  all  other  layers,  which  may  speed  up  the  learning 
process.  The  MLP  classifier  is  trained  to  perform  a  two-class  problem,  with  training  output 
values  of  ±1.  Its  sole  task  is  to  decide  whether  a  given  input  pattern  is  a  target  (indicated  by  a 
high-output  value  of  around  +1)  or  clutter  (indicated  by  a  low-output  value  of  around  -1).  The 
MLP  is  trained  in  batch  mode  using  Qprop  [7],  a  modified  backpropagation  algorithm,  for  a 
faster  but  stable  learning  course. 

Alternatively,  the  eigenspace  transformation  can  be  implemented  as  an  additional  linear  layer 
that  attaches  to  the  input  layer  of  the  simple  MLP  above.  The  resulting  augmented  MLP 
classifier,  which  is  collectively  referred  to  as  PCAMLP  network  in  this  paper,  consists  of  a 
transformation  layer  and  a  back-end  MLP  (BMLP).  When  the  weights  connecting  the  new  input 
nodes  to  the  Ath  output  node  of  the  transformation  layer  are  initialized  with  the  Ath  PCA 
eigenvector,  the  linear  summation  at  the  Ath  transformation  output  node  is  equivalent  to  the  Ath 
projection  value. 
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3.3  Experimental  Results 

The  clutter  rejection  algorithm  was  trained  on  data  collected  at  Aberdeen  Proving  Ground  (APG) 
and  Fort  Knox  in  1999,  and  tested  on  data  collected  at  APG  in  1999  and  Fort  Knox  in  2000.  The 
APG  data  were  divided  so  that  the  training  and  test  data  were  collected  on  different  days,  but  the 
backgrounds  are  similar  because  the  Perryman  test  site  is  rather  uniform.  ROC  curves  for  the 
detection/clutter  rejection  system  are  shown  in  Figure  14.  The  curves  show  that  the  clutter 
rejecter  that  uses  the  detector  features  combined  with  the  output  of  the  MLP  by  a  higher  level 
MLP  performs  better  than  the  detector  alone  on  the  training  and  test  data.  Simply  using  the 
output  of  the  MLP  of  the  clutter  rejector  results  in  worse  performance  than  the  detector  alone 
because  the  MLP  does  a  good  job  of  separating  targets  and  clutter  but  a  poor  job  of  estimating 
the  confidence. 


Figure  14.  Performance  curves. 
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4.  Target  Recognition 


4.1  Introduction 

The  algorithm  described  here  was  designed  to  address  a  need  for  a  recognition  algorithm  that 
could  be  trained  with  a  small  amount  of  data,  with  poor  range  and  localization  information.  The 
operational  scenario  is  to  examine  objects  that  have  been  detected  by  another  algorithm  to 
determine  if  they  are  one  of  the  objects  stored  in  an  existing  image  library.  The  detection 
location  given  by  the  detection  algorithm  may  be  poorly  centered  on  the  target,  and  the  range  to 
the  target  will  not  be  known.  The  number  of  training  examples  for  the  four  different  targets 
differed  radically.  This  meant  that  the  chosen  algorithm  must  be  able  to  take  advantage  of  a  large 
training  set  when  it  exists  but  still  be  able  to  perform  well  for  smaller  data  sets. 

Many  techniques  have  been  applied  to  the  recognition  problem  [1].  When  training  sets  have  been 
large,  recognition  algorithms  have  typically  used  complex  learning  algorithms  that  use  a  large 
number  of  features  to  discriminate  between  targets.  Often  the  features  are  either  simply  the  pixel 
values,  or  simple  gradient/wavelet  features  calculated  in  a  dense  grid  across  the  target  region 
[3, 4,  8].  The  learning  algorithms  include  complex  template  matching  schemes  [8]  or  neural 
networks  [2-4, 9, 10].  Learning  algorithms  that  are  trained  on  small  data  sets  tend  to  generalize 
poorly,  so  we  chose  not  to  use  these  algorithms  for  this  work. 

Some  algorithm  designers  have  used  PCA  to  compress  the  data  prior  to  recognition  [1 1].  An 
advantage  of  this  approach  is  that  it  reduces  the  number  of  features  that  a  classifier  can  use,  and 
thus  reduces  the  size  of  the  required  training  set.  One  disadvantage  is  that  the  compression 
eliminates  some  of  the  information  that  is  useful  to  perform  discrimination,  and  because  the  PCA 
algorithm  optimizes  the  compression  of  the  data  without  regard  for  information  that  is  useful  for 
discrimination,  one  cannot  expect  that  PCA  gives  the  most  discrimination  information  possible 
for  a  given  number  of  features. 

The  data  set  used  for  our  training  was  lopsided.  The  algorithm  attempts  to  recognize  four  targets, 
two  real  (Ml  13  and  HMMWV)  and  two  target  boards  (TBl  and  TB2).  For  the  Ml  13  and 
HMMWV,  we  have  1239  and  2080  suitable  training  samples,  whereas  for  the  target  boards  we 
have  14  and  22  suitable  training  samples.  The  ramifications  of  this  imbalance  will  be  discussed. 

The  remainder  of  this  technote  is  organized  as  follows:  Section  2  describes  the  data  used  to  train 
and  test  the  system.  Section  3  describes  the  architecture  of  the  recognizer.  Section  4  gives  the 
results  of  experiments  performed  on  a  small  test  set  of  imagery.  Section  5  contains  conclusions 
and  plans  for  future  work. 

4.2  The  Data 

The  training  and  testing  data  were  gathered  from  various  sources.  The  testing  set  consisted  of 
suitable  images  from  the  Fort  Indiantown  Gap  data  collection  of  2001.  Images  were  selected  that 
met  a  number  of  conditions.  The  images  must  contain  the  target  at  sufficient  resolution  for 
human  recognition.  The  target  images  must  be  nearly  unoccluded.  Fort  Indiantown  Gap  data 
were  chosen  for  testing  because  the  actual  DEMO  scheduled  for  September  200 1  is  to  be  located 
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there.  Also,  these  data  were  taken  with  the  same  sensor  configuration  that  will  be  used  for  the 
DEMO.  The  test  data  contained  64  images  of  the  Ml  13  and  45  images  of  the  target  boards. 

The  training  set  consisted  of  images  taken  with  previous  configurations  of  the  sensor,  or  with 
another  sensor.  Because  the  amount  of  data  from  the  latest  sensor  configuration  was  small,  we 
decided  to  save  all  of  it  for  testing,  and  obtain  data  firom  other  sources  for  training.  The  test  on 
the  most  appropriate  data  would  let  us  know  if  the  outside  data  were  unsuitable.  The  only  other 
data  of  the  target  boards  were  obtained  with  the  same  sensor,  but  with  an  eight-bit  digitizer.  This 
resulted  in  many  saturated  images;  in  particular,  target  regions  were  likely  to  be  saturated.  Other 
data  of  the  Ml  13  and  HMMWV  included  data  from  previous  versions  of  the  sensor,  and  from 
different  sensors  that  we  had  on  hand.  For  the  training  data,  we  had  14  and  22  images  of  the 
target  boards,  and  1239  and  2080  images  of  the  Ml  13  and  HMMWV. 

The  lopsided  training  set  suggests  an  algorithm  architecture  that  can  handle  a  wide  variation  of 
training  set  size  and  variability.  The  numbers  given  above  overstate  the  problem  for  a  couple  of 
reasons.  The  target  boards  are  two-dimensional  plywood  boards  with  attached  heating  panels, 
and  as  such  there  is  essentially  one  pose.  The  real  vehicles  can  be  seen  fi’om  an  arbitrary  azimuth 
angle  and  some  variation  in  elevation.  The  algorithm  groups  all  of  these  poses  into  four  groups 
for  the  purpose  of  PCA  eigenvector  generation.  Also,  the  signature  of  the  target  boards  is  not 
nearly  as  variable  as  the  real  targets.  The  target  boards  have  fixed  heating  panels,  so  the  greatest 
variability  is  the  angle-of-view  variation,  and  the  relative  temperatures  of  the  heating  panels,  the 
bare  plywood,  and  the  background.  The  solar  irradiation  on  the  panel  should  be  nearly  constant 
because  the  panel  is  flat.  The  real  target  signatures  vary  because  the  amount  of  solar  irradiance 
differs  on  different  portions  of  the  target;  the  exercise  state  effects  different  parts  differently  (hot 
wheels  if  there  has  been  movement,  hot  engine  if  engine  is  running,  regardless  of  movement, 
etc.),  and  the  pose  varies.  Still,  it  can  be  expected  that  the  training  set  of  the  target  boards  does 
not  capture  the  variability  as  well  as  for  the  real  vehicles,  and  it  is  therefore  important  that  a  bias 
reduction  technique  is  used  after  the  PCA  transformation. 

4.3  Algorithm  Architecture 

We  have  chosen  a  PCA  decomposition/reconstruction  technique  for  the  algorithm.  The  idea  is  to 
calculate  a  PCA  decomposition  of  each  target-pose  group  using  the  training  set.  For  testing,  each 
target  is  decomposed  using  the  first  n  PCA  eigenvectors,  then  reconstructed,  and  the  mean  square 
error  (MSE)  of  the  difference  between  the  original  and  reconstructed  target  is  calculated.  This 
gives  one  MSE  value  for  each  target-pose  group.  The  minimum  reconstruction  error  should 
occur  for  the  correct  target-pose  group.  Because  the  PCA  captures  a  different  proportion  of  the 
total  information  for  each  of  the  target-pose  groups,  the  MSE  values  are  adjusted  by  a  weighting 
vector  prior  to  choosing  the  minimum  value.  We  emphasize  that  the  data  set  drives  the  choice  of 
algorithm  architecture. 

The  PCA  decomposition  does  not  capture  all  of  the  information  in  the  input  target,  because  the 
decomposition  is  truncated  at  some  small  number  n  of  eigenvectors.  Also  referred  to  as  the 
Hotelling  transform  or  the  discrete  Karhunen-Loeve  transform,  PCA  is  based  on  statistical 
properties  of  vector  representations.  PCA  is  an  important  tool  for  image  processing  because  it 
has  several  useful  properties,  such  as  decorrelation  of  data  and  compaction  of  information 
(energy).  The  basic  theory  of  PCA  was  described  in  the  clutter  rejection  section. 
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4J3.1  PCA  Decomposition/Reconstruction  Architecture 

The  PCA  decomposition  described  above  takes  a  training  set  of  images  and  turns  it  into  an 
ordered  set  of  eigenvectors  and  corresponding  eigenvalues.  This  decomposition  is  performed  for 
each  target-pose  group  of  training  samples.  Because  there  are  two  real  targets,  which  are  divided 
into  four  pose  groups  each,  and  two  target  board  types,  which  only  have  one  pose,  there  are  a 
total  of  10  target-pose  groups.  We  have  chosen  the  number  of  eigenvectors  to  retain  to  be  5,  for  a 
total  of  50  stored  eigenvectors.  The  eigenvectors  are  stored  at  different  scales  to  account  for 
variability  in  the  range  to  potential  targets.  The  eigenvectors  for  each  of  the  10  target-pose 
groups  are  shown  in  Figures  15-24. 


Figure  15.  Eigenvectors  of  HMMWV  front  side. 


Figure  16.  Eigenvectors  of  FIMMWV  left  side. 


Figure  21.  Eigenvectors  of  Ml  13  back  side. 


‘‘in. 


Figure  22.  Eigenvectors  of  Ml  13  nght  side. 


Figure  23.  Eigenvectors  of  target  board  1. 


Figure  24.  Eigenvectors  of  target  board  2. 

The  decomposition  stage  determines  the  PCA  components  of  a  sample  being  tested.  The 
components  7,  for  an  input  target  image  x  are  calculated  as 

n 

y=i 

Thus,  the  /th  PCA  component  of  an  input  vector  x  is  simply  the  dot  product  of  x  with  the  /th 
eigenvector.  The  reconstruction  using  the  first  k  eigenvectors  is 


The  reconstruction  error  is  simply 


4.3.2  Linear  Weighting  of  Reconstruction  Error 

To  reduce  the  bias  inherent  in  the  PCA  decomposition  process,  the  reconstruction  errors  for  each 
target-pose  group  are  multiplied  by  a  fixed  weight.  Thus,  the  reconstruction  error  for  the  /th 

A 

target-pose  group,  £/  is  weighted  by  a  weight  (Oi.  The  target-pose  decision  /  is  given  by 

/  =  argmiriiicOiSi) .  (33) 

The  weights  (Ot  were  chosen  by  experiment. 

4J.3  Scale  and  Shift  Search  Space 

It  is  anticipated  that  this  recognizer  will  be  used  after  an  automated  detector  has  found  potential 
targets  in  an  image.  It  must  be  assumed  that  any  detector  will  be  imprecise  about  centering  the 
detection  on  the  target.  For  the  DEMO  III  application,  the  range  to  the  target  is  also  unknown,  at 
least  for  some  of  the  scenarios.  Any  template  matching  algorithm  is  inherently  sensitive  to 
translation  and  scale  of  the  image.  The  algorithm  was  written  to  allow  the  user  to  specify  the 
range  uncertainty,  as  well  as  the  translation  uncertainty.  If  accurate  range  or  translation  is  known, 
these  will  help  algorithm  performance.  However,  inaccurate  information  will  degrade 
performance  more  than  lack  of  information. 

The  algorithm  handles  this  uncertainty  by  performing  the  decomposition/reconstruction 
operation  at  a  number  of  different  scales  and  at  a  few  location  around  the  pixel  indicated  by  the 
detector.  Iterating  through  possible  ranges  and  target  locations  increases  the  probability  that  a 
false  target-pose  will  give  a  minimum  reconstruction  error.  The  translation  uncertainty  is 
specified  in  pixels.  The  user  is  required  to  specify  a  minimum  and  maximum  range;  if  nothing  is 
known  about  the  range  to  a  target,  the  minimum  and  maximum  ranges  can  be  derived  from 
knowledge  of  the  sensor  and  knowledge  of  the  minimum  resolution  required  by  a  recognizer. 
Range  information  can  be  derived  from  digital  maps,  shape  from  motion  algorithms,  or  laser 
ranging.  It  is  anticipated  that  for  the  current  implementation,  digital  maps  will  be  the  only  regular 
source  of  range  information. 

4.4  Experimental  Results 

A  detection  algorithm  described  elsewhere  [12,  13]  was  applied  to  the  test  imagery.  The 
recognition  algorithm  takes  as  input  the  original  image  and  the  detection  file  produced  by  the 
detector.  The  recognizer  was  not  given  the  ground  truth  center  of  the  targets,  only  the  detector 
estimated  center.  Table  1  shows  the  confusion  matrix  on  the  four  class  problem.  The  overall 
probability  of  correct  identification  is  59.63%. 
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Table  1.  Confusion  matrix  on  test  set. 


TB2 

6 

41 

MtM 

14 

TBl 

1 

11 

10 

1 

TB2 

6 

2 

14 

Figure  25  shows  a  sample  image  that  does  not  contain  a  target.  Figures  26-29  contain  targets. 
Some  of  the  targets  would  be  difficult  for  a  human  to  distinguish.  The  target  in  Figure  26  is 
difficult  to  distinguish  because  the  shape  is  not  clearly  that  of  an  Ml  13,  and  there  is  little  interior 
information  because  the  whole  target  is  hot.  Figure  27  is  clearly  an  Ml  13  because  the  rectangular 
plate  on  the  upper  front  of  the  target  is  a  distinguishing  characteristic.  Notice  the  target  is  not 
level;  this  makes  recognition  more  difficult  because  the  templates  aren't  well  aligned.  The 
algorithm  doesn't  currently  tilt  the  templates  to  handle  such  a  case.  Doing  so  would  make  it  more 
likely  to  correctly  identify  tilted  targets  but  would  increase  the  probability  of  error  on  level 
targets,  and  would  increase  computation  time.  Figure  28  is  a  good  example  of  a  target  that  is 
difficult  for  an  algorithm  to  detect  but  easy  for  a  human.  The  target  does  not  have  a  clear 
boundary,  nor  is  it  hotter  than  its  background.  The  detector  and  recognizer  give  correct  results 
for  this  image,  but  that  is  unusual.  Figure  29  shows  a  target  board  type  II  clearly  visible  in  the 
left  center  of  the  image.  While  this  is  clearly  a  target  board,  it  is  not  easy  to  see  which  type  at  this 
resolution. 
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Figure  26.  An  image  of  the  left  side  of  an  Ml  13. 


Figure  27.  Side  view  of  an  Ml  13. 
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Figure  28.  Front  view  of  an  Ml  13,  on  the  road 
near  the  center  of  the  image. 


Figure  29.  View  of  target  board  type  II. 


5.  Conclusions  and  Future  Work 


There  is  a  great  deal  of  work  that  ean  still  be  done  to  improve  the  system  described.  Future  work 
on  the  detector  might  include  a  more  systematic  evaluation  of  potential  features  and  an  improved 
classification  scheme  that  allows  useful  features  that  appear  to  be  rarely  incorporated.  In  a  small 
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minority  of  FLIR  images  of  targets,  a  windshield  will  reflect  cold  sky,  causing  a  few  pixels  to  be 
extremely  dark.  The  current  scheme  is  not  set  up  to  incorporate  such  features  because  the 
weighting  would  be  quite  low  since  the  feature  is  seldom  useful.  The  recognizer  would  greatly 
benefit  from  a  balanced  training  set,  which  would  allow  for  a  more  sophisticated  bias  reduction 
scheme  and  would  enable  the  formation  of  a  better  PCA  representation  of  each  target.  The 
algorithms  could  benefit  from  more  input  information.  All  of  the  components  would  benefit  from 
more  accurate  range  information,  which  could  be  obtained  using  accurate  registration  to  digital 
maps,  from  structure  and  motion  algorithms,  or  from  a  laser  range  finder.  The  UGV  has  a  color 
TV  camera  collocated  with  the  FLIR,  which  could  provide  additional  target  screening  capability 
during  the  day. 
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